{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LinearRegression线性回归"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 代价函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$J(\\theta)=\\frac{1}{2m}\\sum_{i=1}^{m}(h_\\theta(x^i)-y^i)^2$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 其中，$h(\\theta)=\\theta_0+\\theta_1x_1+\\theta_2x_2+...$。现在我们就是要求出$\\theta$，使得代价函数最小，即代表我们拟合出来的数据离真实值最近。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 共有m个样本，$(h_\\theta(x^i-y^i)^2$代表拟合出的值到真实值的差的平方，平方的原因是可能有负值，正负值会抵消。前面的系数2在求梯度下降求偏导的时候会约去。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 实现代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 计算代价函数\n",
    "def loss(X,y,theta):#这里的X是真实数据前面加了一列，因为有theta0\n",
    "    m = len(y)\n",
    "    loss = 0\n",
    "    \n",
    "    loss = np.sum((np.dot(X,theta) - y)**2) / (2 * m)\n",
    "    return loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 梯度下降算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 代价函数对$\\theta_j$求偏导，可以得到"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\frac{\\partial J(\\theta)}{\\partial \\theta_j}=\\frac{1}{m}\\sum_{i=1}^{m}[(h_\\theta(x^i) - y^i)x_j^i]$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 所以对$\\theta$的更新可以写为："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\theta_j = \\theta_j-\\alpha \\frac{1}{m}\\sum_{i=1}^{m}[(h_\\theta(x^i) - y^i)x_j^i]$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 其中，$\\alpha $是学习率，代表每次梯度下降的速度，一般为0.01,0.03,0.1，..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 梯度下降算法是机器学习中常用的算法，计算参数的方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 实现代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#梯度下降\n",
    "def grandientDescent(X,y,theta,alpha,num_iters):\n",
    "    m = len(y)\n",
    "    \n",
    "    J_history = np.zeros(num_iters)#存放每次迭代的loss的值，[num_iters x 1]\n",
    "    \n",
    "    for i in range(num_iters):\n",
    "        Z = X.T.dot(X.dot(theta) - y) * alpha / m#更新参数\n",
    "        theta = theta - Z\n",
    "        J_history[i] = loss(X,y,theta)#计算损失\n",
    "    return theta,J_hiistory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 均值归一化"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 目的是使数据缩放到一个范围内，便于使用梯度下降法。比如房子的面积是1200平方米，房间数是5，2个特征值差别较大，需要对其进行归一化。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 归一化公式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$x_i=\\frac{x_i-\\mu _i}{s_i}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$x_i$是每一个样本数据，$\\mu _i$是平均值，$s_i$可以是标准差，也可以是最大值-最小值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 实现代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#归一化feature\n",
    "def featureNormaliza(X):\n",
    "    X_norm = np.array(X)#将X转化为numpy数组，进行矩阵运算\n",
    "    #定义变量\n",
    "    mu = np.zeros((1,X.shape[1]))#(1 x X.shape[1]) \n",
    "    sigma = np.zeros((1,X.shape[1]))\n",
    "    \n",
    "    mu = np.mean(X_norm,0)#计算每一列的平均值，0代表列\n",
    "    sigma = np.std(X_norm,0)#求每一列的标准差\n",
    "    \n",
    "    for i in range(X.shape[1]):#遍历列\n",
    "        X_norm[:,i] = (X_norm[:,i] - mu[i]) / sigma[i]#归一化\n",
    "        \n",
    "    return X_norm,mu,sigma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 不仅要对训练数据进行归一化，而且要对测试数据归一化。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用scikit-learn库中的线性模型实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 导入包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import linear_model\n",
    "from sklearn.preprocessing import StandardScaler#特征缩放"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 归一化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = StandardScaler()\n",
    "scaler.fit(X)\n",
    "x_train = scaler.transform(X)\n",
    "x_tets = scaler.transform(np.array([1650,3]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 线性模型拟合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = linear_model.LinearRegression()\n",
    "model.fit(x_train,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 预测"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = model.predict(x_test)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
